142

Applications in Natural Language Processing

FIGURE 5.15

Overview of BiBERT, applying Bi-Attention structure for maximizing representation infor-

mation and Direction-Matching Distillation (DMD) scheme for accurate optimization.

TABLE 5.7

Quantization results of BiBERT on GLUE

benchmark. The average results of all tasks

are reported.

Method

#Bits

Size

GLUE

BERT-base

full-prec.

418

82.84

BinaryBERT

1-1-4

16.5

79.9

TernaryBERT

2-2-2

28.0

45.5

BinaryBERT

1-1-2

16.5

53.7

TernaryBERT

2-2-1

28.0

42.3

BinaryBERT

1-1-1

16.5

41.0

BiBERT

1-1-1

13.4

63.2

BERT-base6L

full-prec.

257

79.4

BiBERT6L

1-1-1

6.8

62.1

BERT-base4L

full-prec.

55.6

77.0

BiBERT4L

1-1-1

4.4

57.7

In summary, this paper’s contributions can be concluded as: (1) The first work to explore

fully binary pre-trained BERT-models. (2) An efficient Bi-Attention structure for maximiz-

ing representation information statistically. (3) A Direction-Matching Distillation (DMD)

scheme to optimize the full binarized BERT accurately.

5.10

BiT: Robustly Binarized Multi-Distilled Transformer

Liu et al.[156] further presented BiT to boost the performance of fully-binarized BERT

pre-trained models. In their work, a series of improvements that enable binary BERT was

identified, which includes a two-set binarization scheme, an elastic binary activation func-

tion with learned parameters, a method to quantize a network to its limit by successively